Homework1

Author

Adaeze Obinelo

Code
old <- read.csv("2002data.csv")
new <- read.csv("2022data.csv")
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(dplyr)
library(lubridate)
library(ggpubr)
library(broom)
library(AICcmodavg)

Step 1

Code
glimpse(new)
Rows: 57,775
Columns: 20
$ Date                           <chr> "01/01/2022", "01/02/2022", "01/03/2022…
$ Source                         <chr> "AQS", "AQS", "AQS", "AQS", "AQS", "AQS…
$ Site.ID                        <int> 60010007, 60010007, 60010007, 60010007,…
$ POC                            <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
$ Daily.Mean.PM2.5.Concentration <dbl> 12.7, 13.9, 7.1, 3.7, 4.2, 3.8, 2.3, 6.…
$ UNITS                          <chr> "ug/m3 LC", "ug/m3 LC", "ug/m3 LC", "ug…
$ DAILY_AQI_VALUE                <int> 52, 55, 30, 15, 18, 16, 10, 29, 54, 47,…
$ Site.Name                      <chr> "Livermore", "Livermore", "Livermore", …
$ DAILY_OBS_COUNT                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ PERCENT_COMPLETE               <dbl> 100, 100, 100, 100, 100, 100, 100, 100,…
$ AQS_PARAMETER_CODE             <int> 88101, 88101, 88101, 88101, 88101, 8810…
$ AQS_PARAMETER_DESC             <chr> "PM2.5 - Local Conditions", "PM2.5 - Lo…
$ CBSA_CODE                      <int> 41860, 41860, 41860, 41860, 41860, 4186…
$ CBSA_NAME                      <chr> "San Francisco-Oakland-Hayward, CA", "S…
$ STATE_CODE                     <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, …
$ STATE                          <chr> "California", "California", "California…
$ COUNTY_CODE                    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ COUNTY                         <chr> "Alameda", "Alameda", "Alameda", "Alame…
$ SITE_LATITUDE                  <dbl> 37.68753, 37.68753, 37.68753, 37.68753,…
$ SITE_LONGITUDE                 <dbl> -121.7842, -121.7842, -121.7842, -121.7…
Code
glimpse(old)
Rows: 15,976
Columns: 20
$ Date                           <chr> "01/05/2002", "01/06/2002", "01/08/2002…
$ Source                         <chr> "AQS", "AQS", "AQS", "AQS", "AQS", "AQS…
$ Site.ID                        <int> 60010007, 60010007, 60010007, 60010007,…
$ POC                            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ Daily.Mean.PM2.5.Concentration <dbl> 25.1, 31.6, 21.4, 25.9, 34.5, 41.0, 29.…
$ UNITS                          <chr> "ug/m3 LC", "ug/m3 LC", "ug/m3 LC", "ug…
$ DAILY_AQI_VALUE                <int> 78, 92, 71, 80, 98, 115, 87, 57, 65, 10…
$ Site.Name                      <chr> "Livermore", "Livermore", "Livermore", …
$ DAILY_OBS_COUNT                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ PERCENT_COMPLETE               <dbl> 100, 100, 100, 100, 100, 100, 100, 100,…
$ AQS_PARAMETER_CODE             <int> 88101, 88101, 88101, 88101, 88101, 8810…
$ AQS_PARAMETER_DESC             <chr> "PM2.5 - Local Conditions", "PM2.5 - Lo…
$ CBSA_CODE                      <int> 41860, 41860, 41860, 41860, 41860, 4186…
$ CBSA_NAME                      <chr> "San Francisco-Oakland-Hayward, CA", "S…
$ STATE_CODE                     <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, …
$ STATE                          <chr> "California", "California", "California…
$ COUNTY_CODE                    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ COUNTY                         <chr> "Alameda", "Alameda", "Alameda", "Alame…
$ SITE_LATITUDE                  <dbl> 37.68753, 37.68753, 37.68753, 37.68753,…
$ SITE_LONGITUDE                 <dbl> -121.7842, -121.7842, -121.7842, -121.7…
Code
head(old)
        Date Source  Site.ID POC Daily.Mean.PM2.5.Concentration    UNITS
1 01/05/2002    AQS 60010007   1                           25.1 ug/m3 LC
2 01/06/2002    AQS 60010007   1                           31.6 ug/m3 LC
3 01/08/2002    AQS 60010007   1                           21.4 ug/m3 LC
4 01/11/2002    AQS 60010007   1                           25.9 ug/m3 LC
5 01/14/2002    AQS 60010007   1                           34.5 ug/m3 LC
6 01/17/2002    AQS 60010007   1                           41.0 ug/m3 LC
  DAILY_AQI_VALUE Site.Name DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
1              78 Livermore               1              100              88101
2              92 Livermore               1              100              88101
3              71 Livermore               1              100              88101
4              80 Livermore               1              100              88101
5              98 Livermore               1              100              88101
6             115 Livermore               1              100              88101
        AQS_PARAMETER_DESC CBSA_CODE                         CBSA_NAME
1 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
2 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
3 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
4 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
5 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
6 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
  STATE_CODE      STATE COUNTY_CODE  COUNTY SITE_LATITUDE SITE_LONGITUDE
1          6 California           1 Alameda      37.68753      -121.7842
2          6 California           1 Alameda      37.68753      -121.7842
3          6 California           1 Alameda      37.68753      -121.7842
4          6 California           1 Alameda      37.68753      -121.7842
5          6 California           1 Alameda      37.68753      -121.7842
6          6 California           1 Alameda      37.68753      -121.7842
Code
head(new)
        Date Source  Site.ID POC Daily.Mean.PM2.5.Concentration    UNITS
1 01/01/2022    AQS 60010007   3                           12.7 ug/m3 LC
2 01/02/2022    AQS 60010007   3                           13.9 ug/m3 LC
3 01/03/2022    AQS 60010007   3                            7.1 ug/m3 LC
4 01/04/2022    AQS 60010007   3                            3.7 ug/m3 LC
5 01/05/2022    AQS 60010007   3                            4.2 ug/m3 LC
6 01/06/2022    AQS 60010007   3                            3.8 ug/m3 LC
  DAILY_AQI_VALUE Site.Name DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
1              52 Livermore               1              100              88101
2              55 Livermore               1              100              88101
3              30 Livermore               1              100              88101
4              15 Livermore               1              100              88101
5              18 Livermore               1              100              88101
6              16 Livermore               1              100              88101
        AQS_PARAMETER_DESC CBSA_CODE                         CBSA_NAME
1 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
2 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
3 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
4 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
5 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
6 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
  STATE_CODE      STATE COUNTY_CODE  COUNTY SITE_LATITUDE SITE_LONGITUDE
1          6 California           1 Alameda      37.68753      -121.7842
2          6 California           1 Alameda      37.68753      -121.7842
3          6 California           1 Alameda      37.68753      -121.7842
4          6 California           1 Alameda      37.68753      -121.7842
5          6 California           1 Alameda      37.68753      -121.7842
6          6 California           1 Alameda      37.68753      -121.7842

Step 2

Code
old_no_NA <- na.omit(old)
new_no_NA <- na.omit(new)

all <- rbind(old_no_NA, new_no_NA)
Code
names(all)[names(all) == "Daily.Mean.PM2.5.Concentration"] <- "PM"
names(all)[names(all) == "Daily_OBS_COUNT"] <- "Obs#"
names(all)[names(all) == "AQS_PARAMETER_CODE"] <- "AQSC"
names(all)[names(all) == "AQS_PARAMETER_DESC"] <- "AQS_DESC"
names(all)[names(all) == "DAILY_AQI_VALUE"] <- "AQI_V"
names(all)[names(all) == "SITE_LATITUDE"] <- "LAT"
names(all)[names(all) == "SITE_LONGITUDE"] <- "LON"
Code
max(all$PM2.5M)
Warning in max(all$PM2.5M): no non-missing arguments to max; returning -Inf
[1] -Inf
Code
all$year <- format(as.Date(all$Date, format="%m/%d/%Y"),"%Y")

all$month <- format(as.Date(all$Date, format="%m/%d/%Y"),"%m")

Step 3

Code
library(leaflet)

PM.pal <- colorFactor(
  palette = c('orange', 'purple'),
  domain = all$year)

leaflet(all) %>% 
  addProviderTiles('CartoDB.Positron') %>% 
  addCircles(lat = ~LAT, lng = ~LON, opacity = 1, fillOpacity = 1, radius = 40)
Code
leaflet(all) %>%
  addTiles() %>%
  addCircles(lng = ~LON, lat = ~LAT, 
             opacity = 0.5, fillOpacity = 0, popup = year, radius = 10, color = ~PM.pal(year)) %>% 
  addLegend('bottomleft', pal=PM.pal, values=all$year,
          title='Location by Year', opacity=1)

Step 4 + 5:

County Comparisons

Code
all %>%
  ggplot() + 
  geom_bar(mapping = aes(x = COUNTY, fill = year)) +
  scale_fill_viridis_d() + 
  theme(axis.text.x = element_text(angle = 90)) +
  labs(y = "Count", title = "Barchart of Site Measurement Count by County")

Code
all %>%
  ggplot() + 
  geom_point(mapping = aes(x = COUNTY, y = PERCENT_COMPLETE, color = year)) +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "Percent complete status of CA county sites")

Code
all %>% 
  ggplot() + 
  geom_point(mapping = aes(x = COUNTY, y = PM, color = year)) +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "PM2.5 by California state COUNTY")

Code
state.anova <- aov(PM ~ COUNTY + year, data = all)
summary(state.anova)
               Df  Sum Sq Mean Sq F value Pr(>F)    
COUNTY         41  553026   13488   174.5 <2e-16 ***
year            1  636394  636394  8231.4 <2e-16 ***
Residuals   68018 5258681      77                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
wilcox.test(PM~year, data = all)

    Wilcoxon rank sum test with continuity correction

data:  PM by year
W = 565899768, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
Code
ggplot(all, aes(x=year, y=PM)) +
  geom_boxplot(outlier.colour=NA) +
  coord_cartesian(ylim = c(0, 100)) +
  xlab(label = "YEAR") +
  ylab(label = "Mean PM2.5") +
  labs(title = "PM Value Distribution by Year")

Conclusions:

  • It appears that the mean PM2.5 emissions are lower in 2022 than they were in 2002 when we look at the boxplot of mean PM2.5 by year.

    • This is supported by the results of the one way anova on mean PM2.5 emission levels by year (statistically significant difference between the two).

      • This is also supported by the results of the 2-way ANOVA on the PM2.5 emission by county –> the mean PM2.5 values are statistically significantly different between the two years.
  • Los Angeles appears to contribute the most to the PM2.5 emission values based on the graph of observation counts by county

Within LA

Code
LA <- all[all$COUNTY == "Los Angeles",]

LA %>%
  ggplot() + 
  geom_point(mapping = aes(x = Site.Name, y = PERCENT_COMPLETE, fill = year)) +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "Completion Status (%) of LA County Sites")

Code
LA %>%
  ggplot() + 
  geom_bar(mapping = aes(x = Site.Name, fill = year), position = "dodge") +
  scale_fill_viridis_d() + 
  theme(axis.text.x = element_text(angle = 90)) +
  labs(y = "Count", title = "Barchart of Site Measurement Count by Site in LA COunty")

Code
ggplot(LA, aes(x=year, y=PM)) +
  geom_boxplot(outlier.colour=NA) +
  coord_cartesian(ylim = c(0, 100)) +
  xlab(label = "YEAR") +
  ylab(label = "Mean PM2.5") +
  labs(title = "PM2.5 Value Distribution for all LA Sites by Year")

Code
ggplot(LA, aes(group= Site.Name, y=PM)) +
  geom_boxplot(outlier.colour=NA) +
  coord_cartesian(ylim = c(0, 100)) +
  ylab(label = "Mean PM2.5") +
  labs(title = "PM2.5 Distribution per LA Site by Year") +
  facet_wrap(~year)

Code
ggplot(LA, aes(x=year, y=AQI_V)) +
  geom_boxplot(outlier.colour=NA) +
  coord_cartesian(ylim = c(0, 100)) +
  xlab(label = "YEAR") +
  ylab(label = "AQI") +
  labs(title = "AQI Value Parameters for all LA Sites by Year")

Code
LA %>% 
  ggplot(aes(x = month, y = PM, fill = year)) +
  stat_summary(fun.data = "mean_sdl", geom = "bar", position = "dodge") +
  stat_summary(fun.data = "mean_sdl", geom = "errorbar", position = "dodge", width = 0.5) +
  scale_fill_viridis_d() + 
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "Mean PM2.5 by Month within LA")

Code
wilcox.test(PM~year, data = LA)

    Wilcoxon rank sum test with continuity correction

data:  PM by year
W = 7158362, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

Conclusions:

  • According to the results of the Mannu, there is a statistically significant difference in the mean PM2.5 emission levels in LA betwen 2002 and 2022

    • looking closer at the data, (Mean PM2.5 by month, and boxplot of mean PM2.5 per year) it appears that the mean PM2.5 in 2002 was around 22 and dropped to around 12.5 in 2022.

    • Looking at the confidence intervals of the boxplot of mean PM2.5 per year and mean PM2.5 per year but by month, it seems that there was much more variation in data observations in 2002. This is also evident looking at the observation count by LA site- less sites contributed complete observations in 2002 compared to 2022.

Month-Month (Statewide)

Code
all %>% 
  ggplot(aes(x = month, y = PM, fill = year)) +
  stat_summary(fun.data = "mean_sdl", geom = "bar", position = "dodge") +
  stat_summary(fun.data = "mean_sdl", geom = "errorbar", position = "dodge", width = 0.5) +
  scale_fill_viridis_d() + 
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "Mean PM2.5 by Month within California")

Code
ggplot(data = all) +
  geom_point(mapping = aes(x = month, y = AQI_V, color = year), position = "jitter")+
  geom_smooth(mapping = aes(x = month, y = PM, color = year)) +
  labs(title = "AQI Value by Month within California")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Code
month.anova <- aov(PM~month + year, data = all)
summary(month.anova)
               Df  Sum Sq Mean Sq F value Pr(>F)    
month          11  230303   20937   260.7 <2e-16 ***
year            1  753342  753342  9381.2 <2e-16 ***
Residuals   68048 5464456      80                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusions:

  • Looking by month, we also see that the mean PM2.5 values in each month of 2002 were lower than their corresponding month in 2022. However, there is more variation in the values in 2002 (as evident by the confidence intervals).

Answer:

It seems that overall the mean PM2.5 levels were lower in 2022 than they were in 2022 when we look at the state wide, and LA wide data by year and by month. However owing to the greater variation in data in 2002 compared to 2022, it is also possible that the overall admission data for 2002 may have changed had more sites contributed.